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Abstract 

Sparse representation classification (SRC) is one of the most promising classification methods for supervised learning. This 
method can effectively exploit discriminating information by introducing a £i regularization terms to the data. With the 
desirable property of sparisty, SRC is robust to both noise and outliers. In this study, we propose a weighted meta-sample 
based non-parametric sparse representation classification method for the accurate identification of tumor subtype. The 
proposed method includes three steps. First, we extract the weighted meta-samples for each sub class from raw data, and 
the rationality of the weighting strategy is proven mathematically. Second, sparse representation coefficients can be 
obtained by £i regularization of underdetermined linear equations. Thus, data dependent sparsity can be adaptively tuned. 
A simple characteristic function is eventually utilized to achieve classification. Asymptotic time complexity analysis is applied 
to our method. Compared with some state-of-the-art classifiers, the proposed method has lower time complexity and more 
flexibility. Experiments on eight samples of publicly available gene expression profile data show the effectiveness of the 
proposed method. 
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Introduction 

The development of high-throughput technologies has enabled 
scientists to monitor the gene expression levels in tens of thousands 
of genes simultaneously in a single experiment. This technology 
has become a symbol of the post-genomic era [1]. Biomedical 
research indicates that tumor development is related to the change 
in gene expression levels and that tumor-related biomarkers are 
usually associated with a few genes. Thus, identifying tumor tissue 
or disease-related biomarkers accurately is of great practical 
significance. However, gene expression profde data are charac- 
terized by very high dimensionalities and small sample size. The 
curse of dimensionality problem makes classification challenging. 

Some dimensionality reduction methods have recently been 
proposed to solve the "large p, small n" problem [2]. Feature 
extraction and feature selection are two methods of dimensionality 
reduction; feature extraction transforms original features (genes) 
into a set of new features by subspace learning [3-5]. However, 
suitable biological interpretation is difficult to obtain from the 
subspace learning dimensionality reduction results. Feature 
selection is another commonly used dimensionality reduction 
method that selects a sub-set of genes that can best predict the 
response values from the raw data [6]. Although dimensionality 
reduction can significantly improve computational efficiency, this 
process can easily lead to over-fitting when a classifier is appUed. 



Sparse representation classification (SRC) was proposed by 
Wright et al. [7] for face recognition. With £i sparsity constraint, a 
testing face can be approximately represented by parts of the 
training data that are from the same class. Unlike traditional 
classification methods such as support vector machine and k 
nearest neighbor classifier, SRC is robust to both noise and 
outiiers. However, the orginal training samples may not contain 
suffiient discriminating information compared with meta-samples 
[8]. 

To capture more alternative information from gene expression 
data, the so-called meta-samples are proposed by [8-11]. These 
samples can be regarded as a set of bases, the linear representation 
of which can represent the training data. In [1 1], penalized matrix 
decomposition is used to extract meta-samples, and clustering is 
performed on those meta-samples. In [8], the meta-sample based 
sparse representation classification (MSRC) method is proposed. 
This method is robust to over-fitting problem and noise. However, 
MSRC needs two predefined parameters, namely, the number of 
meta-samples and the sparse penalty factor. These two parameters 
are data dependent. Thus, model selection methods, such as cross- 
validation (CV), significantly affect the classification results. In this 
study, we propose a non-parametric version of MSRC to address 
this optimal parameter selection problem. The main contributions 
of this paper are as follows: 
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Table 1. Notations and abbreviations used In this paper. 





Notation 


Description 


SVD 


Singular value decomposition 




A'^ dimensional real number vector 


X 


^ = {xi,X2,.--,Xm} e Jf"^'" denotes gene expression data set with n genes, m 
samples 


w 


W = [Wi ,W2,....W,.] meta-samples associate with c classes 


|Cj| Number of samples belong to class / 


llllo 


4 norm 


mil ^1 norm 


llllf 


Matrix Frobenius norm 



doi:10.1371/journal.pone.0104314.t001 



1 . The data-dependent sparsity can be automatically adjusted, 
rather than empirically chosen. Without computationally 
expensive model selection, our method is scalable and efficient. 

2. The existing MSRC [8] method requires the appropriate 
selection of the number of meta-samples for each sub class, 
which is a laborious task. We address this problem by 
introducing a simple weighting strategy for the meta-sample 
of each category, and the rationality of weighting strategies is 
mathematically proved. 

3. Extensive experiments are performed to evaluate the proposed 
method. Experimental results show the superiority of the non- 
parametric version of MSRC compared with some state-of-the- 
art classifiers. Section 3 presents more details. 

The remainder of this paper is organized as follows: prior work 
on sparse representation classification and the fundamentals of the 
proposed method are described in Section 2. Section 3 presents 
the experimental results. The proposed method is discussed in 
Section 4. Section 5 concludes this paper. 



Methods 

This study primarily aims to establish the manner by which to 
devise an robust classifier for tumor subtype classification. Given a 
microarray data set A' = {xi,X2,...,x„,} e K"""' and a set of class 
labels C = {l,2,...,c}, X is a matrix with n rows and m columns. 
Each column of X denotes a sample, whereas each row of X 
denotes a gene. Let xy denote the jth sample, which is a column 
vector with n dimensional. For each element in X, A',- y e 5J denotes 
the expression level of the ith gene in the jth sample. We provide a 
summary of the abbreviations used in this study in Table 1. For 
clarity, we use boldface and lowercase type letters for vectors and 
boldface and capital type letters for matrices. 

Gene expression profile data are high-throughput data with tens 
of thousands of genes. However, the number of samples is usually 
very small, which makes classification challenging. To avoid the 
curse of dimensionality, differential gene expression analysis 
[12,13] is widely used to exclude redundant and irrelevant genes 
before classification. In our study, we use the Relieff [14] method 
to select a subset of informative genes for further analysis. In the 
following subsections, we briefly review meta-sample and sparse 
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Figure 1. Illustration of meta-sample model: each column vector of X„, can be represented within a linear combination of meta- 
samples in W„,, and the column of H,^ corresponds to the linear combination coefficients. 

dol:1 0.1 371 /journal.pone.01 0431 4.g001 
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Figure 2. Optimal classification accuracy of MSRC achieved on COLON; the A-axis represents the number of meta-samples (left) and 
the regularization parameter (right). Classification accuracy is more sensitive to the number of meta-samples rather than to the regularization 
parameter. 
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representation classification, we then propose weighted meta- 
sample based parameter free sparse representation classification 
(PFMSCR). 

Meta-samples versus gene expression samples 

As illustrated in Figure 1 , meta-samples can be regarded as basis 
samples that contain the essential information of the original data. 
A given testing sample can be represented by a linear combination 
of meta-samples from the same class. Concretely, suppose x,- is 
associated with the class, where n, e C, and the itith class 
samples in the training data have k meta-samples, namely, 
{wi,W2,...,W;i} 6 SR"^''. Sample X,- can be formulated as Eq. (1). 

X,- = Wl + W2 A2,/ + . . . + y/khkj ( 1 ) 

Mathematically, meta-samples extraction can be regarded as a 
type of matrix decomposition, including non-negative matrix 
factorization [15], singular value decomposition (SVD) [16], and 
principal component analysis [17], where matrix W„, e 3?"^*', and 
H„';eK'^'<l"'l denote the meta-sample and meta-gene, respectively. 
In singular value decomposition, W«,. is a maximum linearly 
independent group of X„, column vectors. 

Biologically, meta-samples are also called eigenarray [18] or 
basis snapshot for gene expression data. Han et al. [17] used meta- 
samples to identify tumors from microarray data and found that 



meta-sample-based classification can effectively avoid over-fitting. 
Zheng et al. [10,1 1,18] proposed a novel cluster method based on 
meta-samples, which meta-samples can be regarded as cluster 
indictors. 

Prior works revealed that meta-samples preserve some desired 
discriminant information of samples from the same class. 

Sparse representation classification problem revisited 

In this subsection, we revisit the sparse representation problem 
briefly. Sparse representation is one of the most important 
components of machine learning and data mining community 
that has wide apphcations in such fields as text mining, image 
classification, and bioinformatics. In this work, we interpret the 
sparse representation problem from the view of linear algebra. 

From the standpoint of linear equations system Xa = y, die 
solution of Xa = y has three possible states: 

1 . Linear equation systems have infinitely many solutions if they 
are underdetermined (i.e., n<m). 

2. Linear equation systems have a unique solution if they are well 
posed. 

3. Linear equation systems have no solution if overdetermined 
(i.e., n>m). 

In the first scenario, one can pursue the sparse solution by 
regularization [19]. The problem can be formulated as 
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Figure 3. The flowchart of PFMSRC scheme. 

doi:1 0.1 371 /journal.pone.01 0431 4.g003 



PLOS ONE I www.plosone.org 



3 



August 2014 | Volume 9 | Issue 8 | e104314 



Parameter Free Sparse Representation Classification for Microarray Data 



Table 2. Descriptions of microarray data repository and the accession number. 







Datasets 


Repository 


Accession number 


Colon 


Gene Expression Omnibus 


GDS4379 


Acute leukemia data 


Gene Expression Omnibus 


GSE19475 


DLBCL 


Gene Expression Omnibus 


GSE15177 


Gliomas 


Gene Expression Omnibus 


GSE54792 


SRBCT 


Gene Expression Omnibus 


GSEl 825,GSE3 1 1 86,GSE3 1217 


ALL 


Gene Expression Omnibus 


GSE23024 


MLLLeukemia 


Gene Expression Omnibus 


GSEl 1038 


LukemiaGloub 


Gene Expression Omnibus 


GSEl 0283 



doi:l 0.1 371 /journal.pone.Ol 0431 4.t002 



s.t. Xa = y 



(2) 



However, £q norm is an NP-hard combinational optimization 
problem, and difficult to solve, fortunately, £\ norm is an 
appropriate convex approximate to £o [20]. If the solution is 
sparse enough, £i minimization is equivalent to £o minimization 
[21], such that we can reformulate Eq. (2) as 



s.t. 



Xa-- 



(3) 



For the other two scenarios, the sparsity of a cannot be 
guaranteed. However, one can still obtain a sparse solution by 
adding a penalty term that shares the same formulation as LASSO 
[22] 



mill l|Xa-y[|2 + All«l| 



(4) 



Compared with Eq. (3), Eq. (4) is an unconstrained convex 
problem. Notably, \ makes a tradeoff between sparsity and 
regression error and should be empirically chosen. A larger X 
yields a sparser 0£. However, one might run the risk of increasing 
regression error term ||Xa — yj|2. 

Sparse representation assumes that a signal can be reconstruct- 
ed by a small number of basis signals within a linear combination. 
Thus, Eq (3) can be named as basis pursuit [23]. In bioinformatics 
applications, one can suppose that a testing sample can be well 
reconstructed by the training data from the same class within a 
linear combination, which is a very useful assumption for our later 
work. 

Meta-sample based sparse representation 

Zheng et al. [8] proposed MSRC method to predict tumor 
subtypes. In such situations, c classes of meta-samples are 
extracted, denoting as W= [Wi,W2,...,W(.] with the same classes 
being conjoined together, where meta-samples are column vectors 
(two kinds of meta-sample are proposed in [8]). Given a test 



sample y associated with class /, MSRC tries to find sparse 
reconstruct coefiicients in terms of all meta-samples using Eq. (4). 
In particular, [8] tries to solve the sparse representation problem 

using min[|Wa — yllj +A||m|Ii ■ In ideal cases, the nonzero entries 

in a will only be associated with the ith class meta-samples of W, as 
shown m Eq. (5). 



ci= [0,...,a,i ,o(,-2 ,...,o;,> 

nil class 



„0f e 



(5) 



Notably, the gene expression profile contains data with high 
dimensionality and small sample size {ny>m). The sparsity can only 
be achieved by adding a penalty term. However, the optimal 
number of meta-samples and penalty factor X are essentially 
important in classification applications. Figure 2 illustrates that if 
the meta-samples are improperly set, the prediction accuracy of 
MSRC drops seriously on COLON dataset. Specifically, in the left 
part of Figure 2 shows that the 1 0-fold stratified cross validation 
classification accuracy is achieved by varying the number of meta- 
samples from 3 to 12 for each subclass. We can observe that the 
performance is less sensitive to various regularization parameters 
within the scope of A from the right part of Figure 2. Thus, model 
selection is essential and laborious work on different data sets. 

To overcome this weakness, this study proposed a novel 
parameter free meta-sample based sparse representation classifi- 
cation (PFMSRC) method. 

Parameter free meta-sample sparse representation 
(PFMSRC) 

In this subsection, we first propose a heuristic weighted strategy, 
the reasonableness of which is theoretically proven. We then 
construct an underdetermined linear equation system, in which 
the data-dependent sparsity can be self-adaptively tuned by £i 
norm regularizer. 

Let X = {'K.i,X2,...,Xc} e be gene expression profile data, 

with the same classes being conjoined together, that is, X,- contains 
all samples associated with the ith class. We factorize X, by 
performing SVD. The singular values are sorted in descending 
order Al^A2^---^A*:>0, where k is the column rank of X; , and 
A = diag(Xi,X2<---Xk) denotes diagonal matrix with singular values 
being diagonal elements. One can extract weighted meta-samples 
associated with class / as W, = [\/X\Ui,^/X2U2,...,\/XkUic\, where M,- 
is a column vector in [/,, and rank{Ti.i) = k. 
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Table 3. Data set descriptions. 





Datasets 


Samples 


Genes 


Subclass number 


Colon 


62 


2000 


2 


Acute leukemia data 


72 


5000 


2 


DLBC 


77 


7129 


2 


Gliomas 


50 


12625 


2 


SRBCT 


83 


2308 


4 


ALL 


248 


12626 


6 


MLLLeukemia 


72 


12582 


3 


LukemiaGloub 


72 


7129 


3 



doi:1 0.1 371 /journal.pone.Ol 0431 4.t003 



/Ai ••• 0\ 



x,-=u,- 



V/, V/,A,>0 



(6) 



\0 ■■■ ^) 



Alternatively, Eq. (6) can be compactly reformulated as 
X, = U,\/A\/AV,^. This weighting scheme can enhance the 
influence of main singular vector in U,-. That is, larger \j makes 
the associated meta-sample more important. Moreover, the 
weighting scheme works well in the following experiments. 
Compared with [8], Zheng et al. extracted meta-samples by 
performing SVD as well. However, in their algorithm framework, 
the number of meta-samples used for classification is determined 
during the cross-validation step. On the contrary, PFMSRC tries to 
avoid the cross-validation part by weighting the all meta-samples 



and weakening the influence of minor eigenvectors rather than 
using several of them for classification. Proposition 1 theoretically 
proves the reasonableness of the weighting strategy in measuring 
the importance of each metasample. 

Proposition 1. Singular value is a reasonable weighting factor 
for measuring the importance of meta-samples. 



Proof. Let X=[u],U2,...,Uk]A[v],V2, 



where 



--diag(XuX2,-,Xk) and Ai >A2 > - > A* >0, X=I]A,u,v/', 



considering evaluation metric function 



||A/UiV/ 

lixiii 



rr(A?Uivf Vjuf ) 
rr(XX^) 

conclude that 



xj 



xl+xl+...+xr 




Figure 4. Comparison of prediction accuracy on four binary classification datasets by varying the number of samples from per 
subclass; when p is larger than 10 the model based method prediction accuracy decreases as /; increases. 

dol:1 0.1 371/journal.pone.01 0431 4.g004 
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Table 4. Comparison on four binary classification tumor data sets; for each data set, 10 samples per class are randomly selected for 
training and the rest are used for testing. 





Dataset name 


LDA+SVM 


ICA+SVM 


SRC 


MSRC-SVD 


PFMSRC 


colon 


74(±7.85) 


64.55(±7.39) 


84.20{±3.65) 


84.20{±4.81) 


8S.45(±3.33) 


DLBC 


66.76{ ± 6.67) 


68.33( + 4.78) 


86.49(±3.39) 


85.35(±4.91) 


86.40{ ± 5.69) 


Gliomas 


65.83{±8.08) 


69.83( + 9.52) 


75.00(±6.35) 


75.83{±7.24) 


77.00( ± 6.48) 


Acute leukemia 


89.71{±3.14) 


89.1 3(± 4.96) 


93.46(±3.82) 


94.52(±3.65) 


96.25(±2.20) 



We report the standard deviations in parentheses. 
doi:l 0.1 371/journal.pone.Ol 04314.1004 



|A,-Uiv/ 



JX 



IIX 



This completes the proof. □ 

The evaluation metric function is used to measure the meta- 
sample's contribution of the meta-sample to the raw data 
reconstruction in terms of X. Tr denotes matrix trace. Note that, 
functions f(x) = x and g(x) = ^Jx have the same monotonicity, 
which makes the weighting strategy reasonable. 

t\ graph was proposed by Cheng et al. [24] to measure the 
similarity among samples. Inspired by their work, sparsity can be 
obtained by i\ regularizer on underdetermined linear equation 
systems. Concretely, a testing sample can be recovered by 
weighted meta-samples within a linear combination with a noise 
term added, formulated as Eq. (7) 



y = Wa + e = 



= [WI1 



Let B = [W I] eSR" and a! 



6K" 



(7) 



where ot' 



represents the number of meta-samples corresponding to c classes, 
/ is an identity matrix, and e is the noise term. Alternatively, one 
can solve the following minimization problem: 



s.t. 



= Ba' 



(8) 



Theorem 
system. As 



1 proves that Eq. (8) 
stated in Subsection 



is a underdetermined linear 
2.2 the sparsity of under- 



determined linear system can be automatically tuned by l\ 
regularization (the first scenario). Moreover, (8) is a canonical 
convex problem with equality constraints, which can optimize 
sparse representation coefficients and noise term simultaneously. 
The globally optimal solution can be efficiently solved by CVX 
package [25] in polynomial time. Notably, the package solves the 
optimization problem by dualization rather than interior point 
method because the former is significandy faster than the latter. 

Theorem 1. Linear equation system (8) is underdetermined, 
and rank(B) = n. 

Proof. We can find a sub matrix in B eW^ (m'+n)^ ^^^^ j 
rank{T) = n^rank(Bt) = n. This completes the proof □ 

Note that a' e +" is a sparse vector with m' + n entries. The 
first m' components correspond to linear representation coeffi- 
cients, whereas the last n components characterize model noise or 
regression error. However, the test sample y from one of the 
classes in training data cannot be well reconstructed by meta- 
samples associated with the same class in most instances because of 
the existence of noises. Figure 3 illustrates the flowchart of our 
PFMSCR scheme, the redundant dictionary is constructed by 
combining meta-samples and noise term. 

We define a projection function 5j{a') : for each class 

i, which selects the coefficients associated with the ith class from 
the first ni components in a', whereas the other entries are 
appropriately padded with zeros in 5,(a'). The reconstruction 
relationship y = W(5,(a') is not always holden. However, the 
minimized reconstruction error criterion 

r,(y)= ||y — W(5,(a')||2, i=l...c is a good approximation to 
classify testing samples. We summarize the proposed classification 
method as follows. 

Step 1. Input training sets X= [Xi,X2,...,Xc] e K"'*"', class 
number c, and testing sample y e SR"; 

Step 2. Normalize training set samples and testing sample to 
obtain unit li-'^orm; 

Step 3. Extract weighted meta-samples W= ['Wi,W2viWt.] for 
each class (meta-samples with the same class are conjoint); 

Step 4. Solve non-parametric sparse representation problem by 
Eq. (8); 



Table 5. Comparison of specificity by different methods 


on four binary classification data sets. 






Dataset name 


SRC 


MSRC-SVD 


PFMSRC 


colon 


90.00 


92.50 


92.50 


DLBC 


96.55 


94.83 


96.55 


Gliomas 


72.73 


77.27 


77.27 


Acute leukemia 


100 


100 


100 
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Table 6. Comparison of sensitivity by different methods 


on four binary classification data sets. 






Dataset name 


SRC 


MSRC-SVD 


PFMSRC 


colon 


81.82 


86.36 


86.36 


DLBC 


1 


1 


94.74 


Gliomas 


82.14 


78.57 


89.29 


Acute leukemia 


88.00 


92.00 


84.00 



doi:1 0.1 371 /journal.pone.Ol 0431 4.t006 



In the following section, we wiU conduct extensive experiments 
on micoarray data to evaluate the effectiveness of our scheme, and 
microarray data repository information as weU as the accession 
number is given by Table 2. 

Experiments 

In this section, we will evaluate the performance of the proposed 
PFMSRC algorithm against four state-of-the-art algorithms, 
namely, linear discriminant analysis (LDA+SVM), independent 
component analysis (ICA+SVM), SRC, and meta-sample sparse 
representation (SVD-MSRC). The former two are model based 
and accompanied by feature extraction. These two algorithms are 
regarded as baseline. For the model-based method, support vector 
machine [26,27] with radial basis function kernel is employed as a 
classifier. The experiments are performed on four binary-class 
classification data sets and four multiclass classification data sets. 
AH experiments are implemented in Madab environment and run 
on a personal computer with Intel Pentium4 dual core CPU 
2.4 GHZ and 4 G RAM. The summarized descriptions of the 
eight gene expression profile datasets are provided by Table 3. 



Step 5. Compute residuals for each class ''/(y) = 
||y-W,5,-(a')|l2, '=l-c; 

Step 6. Return class label of y as c(y)= argmin r(y), i=\,...,c; 

i 

PFMSRC can be considered as a non-parametric version of 
MSRC, compared with the former having the following merits: 

1 . The weighted meta-samples are orthogonal with one another. 
That is, no redundancy exists among meta-samples, and the 
weight enhances the influence of the main singular vector, such 
that discriminant information can be well retained. 

2. The data-dependent sparsity can be automatically tuned 
without human intervention. Thus, PFMSRC has better 
scalability and robustness. 

3. The time complexity of PFMSRC is lower than that of MSRC, 
since computationally expensive model selection work need not 
be accomplished for parameter optimization. Time complexity 
can be estimated as: weighted meta-sample extraction step 
needs time complexity O(nm^), l\ minimization needs time 
complexity 0{{m + n)^'^), the total complexity for PFMSRC is 
0(nrrp- +m{m + nf'^). 




Figure 5. Comparison of prediction accuracy on four multiclass classification datasets by varying the number of samples from per 
subclass; when p is larger than 10 the performance degradation of model based methods is less significant than that of binary 
classification. 

doi:1 0.1 371 /journal.pone.Ol 0431 4.g005 
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Table 7. Comparison on four multiclass tumor data sets; for each data set, 10 (8 for LeukemiaGloub) samples per class are 
randomly selected for training the rest are used for testing. 





Dataset name 


LDA+SVM 


ICA+SVM 


SRC 


MSRC-SVD 


PFMSRC 


SRBCT 


91.05{±4.61) 


88.72(±5.56) 


96.86{ ± 2.64) 


97.56(±3.06) 


96.98(±2.51) 


ALL 


86.1 2{± 3.81) 


91.38(±3.28) 


94.07{±2.38) 


94.07( ± 2.93) 


96.73(±1.68) 


MLLLeukemia 


93.81 (±3.74) 


93.33(±5.16) 


95.36(±3.04) 


95.36{±2.84) 


95.83(±2.88) 


LukemiaGloub 


73.75(±5.25) 


77.50(±6.98) 


95.83(±2.14) 


95.21(±2.35) 


94.90( ± 2.74) 



The average accuracy and corresponding standard deviations are reported. 
doi:10.1371/journal.pone.0104314.t007 



• Colon [28] consists of 62 samples with two subclasses including 
40 tumor and 22 normal samples. The highest 2000 genes with 
minimal intensity in the tissues are retained from the original 
of more than 6500 genes. This dataset can be downloaded 
from [29], 

• Acute leukemia data [30], consist of 72 samples with two 
subclasses, including 47 acute lymphoblastic leukemia patients 
and 25 acute myelogenous leukemia patients. Each sample 
contains 7129 genes. This dataset can be downloaded from 
[29]. 

• DLBCL [1] consists of 77 samples with two subclasses, 
including 58 diffuse large b-ceU lymphoma samples and 19 
follicular lymphoma samples. Each sample contains 7129 
genes. This dataset can be downloaded from [31]. 

• Gliomas [32] consist of 50 samples with two subclasses 
(Glioblastomas and Anaplastic Oligodendrogliomas), and each 
sample contains 2308 genes. This dataset are available at [31]. 

• SRBCT [33] consist of 83 samples with four subclasses (Ewings 
sarcoma, Burkitts, Neuroblastoma and rhabdomyosarcoma). 
Each sample contains 2308 genes. The datasets are available at 
[31] 



• ALL [34] consists of 248 samples with six subclasses. Each 
sample contains 12626 genes. The datasets are available at 
[31]. 

• MLLLeukemia [35] consists of 72 samples with three 
subclasses. Each sample contains 12582 genes. The datasets 
are available at [29]. 

• LukemiaGloub [30] consists of 72 samples with three 
subclasses. Each sample contains 7129 genes. The datasets 
are available at [31]. 

Dataset preprocessing and experiment setup 

Gene expression profiling involves data with high dimension- 
ality and small sample size. The exclusion of redundant and 
irrelevant data is critical for classification. As suggested by [36], 
restaining only the top 400 genes makes a good tradeoff between 
computational complexity and biological significance. In our 
experiment, the top 400 genes are selected from each dataset by 
applying the ReliefF [14] algorithm to the training set. 

For LDA+SVM algorithm, we simply extract c — 1 new features 
to train the classifier, as LDA can find at most c — 1 meaningful 
projection vectors in the subspace, where c denotes the number of 
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Figure 6. Comparison of prediction accuracy on four binary classification datasets by varying tKie number of top selected genes. 

doi:1 0.1 371/journal.pone.01 0431 4.g006 
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Figure 7. Comparison of prediction accuracy on four multiclass classification datasets by varying the number of top selected genes. 

doi:1 0.1 371 /journal.pone.01 0431 4.g007 



classes. SVM kernel parameters are determined by 10-fold cross- 
validation. In fact, the determination of the number of indepen- 
dent components is also an empirically dependent work. Here, we 
use the same method as suggested by [18]. 

SRC and MSRC methods need parameter X to control sparsity. 
MSRC also needs the number of meta-samples of each class as a 
key parameter. Each dataset X is searched from 
{0.001,0.1,1,10,100} by 10-fold CV on training data, and the 
number of meta-samples for each class is set as recommended by 
[8]. 

Experiments on binary classification problem 

To evaluate the performance of five methods on a balanced split 
data set, we randomly select p = 5 io min(|c,|)— 1 samples per 
subclass as training set and use the rest for testing to guarantee that 
at least one sample in each category can be used for test, 20 times 
training/ testing are randomly split, and the average classification 



accuracies are presented. The best prediction accuracy is in 
boldface for each gene expression profile dataset. 

We show the average performance comparison on four binary 
classification tasks in Figure 4. PFMSRC exhibited encouraging 
performance. Although Gliomas was difficult for classification, the 
proposed approach can still achieve 85% classification accuracy 
via 20 samples per subclass used for training. Notably, the 
classification accuracy of LDA+SVM and ICA+SVM dropped 
quickly as more samples are taken for training; the same 
observations can be found in [36]. This fluctuation phenomenon 
can be interpreted as follows: (1) For the binary classification case, 
the feature extracted by LDA has only one dimension that is 
insufficient to capture the intrinsic discriminating information. 
Thus, model-based classification methods have difficulty in 
preventing the over-fitting phenomenon. (2) When evaluating 
the performance on the testing set the number of samples changes 
as more samples are used for training. 



Table 8. The maximal average prediction accuracy of LDA+SVM, ICA+SVIVl, SRC, IVISRC-SVD and PFMSRC on eight tumor 
microarray datasets. 





Dataset name 


LDA+SVM 


ICA+SVM 


SRC 


MSRC-SVD 


PFMSRC 


colon 


61.67 


76.90 


80.48 


84.05 


83.81 


DLBC 


68.07 


71.05 


89.47 


88.42 


89.47 


Gliomas 


67.33 


70.67 


75.33 


75.00 


76.00 


Acute leukemia 


85.38 


88.85 


93.27 


95.00 


95.19 


SRBCT 


91.16 


89.30 


97.21 


97.21 


97.21 


ALL 


85.16 


91.44 


96.46 


93.59 


97.02 


MLLLeukemia 


96.43 


94.05 


96.43 


96.67 


97.14 


LukemiaGloub 


81.63 


91.81 


94.79 


94.68 


96.05 



doi:l 0.1 371 /journal.pone.01 0431 4.t008 
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Table 9. 10-fold CV prediction accuracy of eight tumor microarray datasets using different classification methods. 



Dataset name 


LDA+SVM 


ICA+SVM 


SRC 


MSRC-SVD 


PFMSRC 


colon 


81.67 


90.00 


87.14 


90.24 


90.24 


DLBCL 


92.14 


97.14 


97.14 


91.96 


95.89 


Gliomas 


86.50 


86.50 


78.33 


78.33 


84.00 


Acute leukemia 


96.50 


95.57 


96.07 


97.50 


95.00 


SRBCT 


96.64 


95.75 


1 


1 


1 


ALL 


97.61 


94.83 


96.46 


93.59 


97.63 


MLLLeukemia 


95.65 


95.89 


98.75 


98.75 


97.32 


LukemiaGloub 


97.32 


96.32 


98.57 


98.57 


96.07 



doi:1 0.1 371 /journal.pone.Ol 0431 4.t009 



Classification accuracy, specificity, and sensitivity are some 
popular evaluation metrics. In this work, we use all three to 
evaluate performance, and the results are reported in Table 4, 5, 
and 6, respectively. The three methods can achieve satisfactory 
performance not only on the specificity metric but also on the 
sensitivity metric. Compared with SRC and MSCR, PFMSRC 
outperforms its competitors in most cases. A comprehensive 
consideration is that PFMSRC achieves the best performance, 
followed by MSRC and SRC. 

Experiments on multiclass classification problem 

We investigate multiclass classification performance on four 
pubUcly available data sets. The experimental setup is the same as 
that for the binary classification case. On one hand from Figure 5 
and Table 7 it can be seen that (1) the classification accuracies of 
SRC, MSRC, and PFMSRC are increased on all multiclass 
classification datasets as more samples per subclass are taken for 
training. (2) ALL has six subclasses, and the proposed PFMSRC 
achieves the highest classification accuracy, which indicates that 
we have potential superiority on multiclass classification task. (3) 
LDA can capture more discriminating information on the 
multiclass classification task, and the over-fitting phenomenon is 
reduced compared with the binary classification task. 

On the other hand, sparse representation based classification 
methods are less sensitive to the number of samples used for 
training model-based classification methods, which suggests a 
natural approach to select a classifier when the training sample size 
is small. Table 7 provides the performance description of the five 
classification methods. The proposed PFMSRC method performs 
consistently well with small standard deviations. On the SRBCT 
and ALL datasets, PFMSRC achieved 96.98% and 96.73%, 
respectively. 

Experiments with different number of genes 

In this subsection, we evaluate the performance of the five 
methods with different feature dimensions on eight tumor data 
sets. For the training data, 10 samples per subclass are randomly 
selected, whereas the remaining samples are used for test. We 
perform the test with various numbers of genes, starting from 50 to 
400 genes in steps of 20. The comparison experiment was 
performed 20 times, and the average prediction accuracy of our 
experiments on eight gene expression profile datasets was recorded 
for evaluation. 

The balanced training sets for each dataset ensure fair 
evaluation as stated by [36]. The experimental result in Figure 6 
shows that the proposed PFMRSC performs well when only 100 



genes are used. We can observe the similar results in the multi- 
classification case as well. 

In binary classification case, SRC, MSRC, and PFMSRC share 
the same curve trend. Compared with SRC and MSRC, 
PFMSRC performs well by using a smaller number of genes, 
SRC and MSRC can achieve comparable accuracy by using more 
genes. Evidently, SRC, MSRC, and PFMSRC consistentiy 
outperform LDA-hSVM and ICA-hSVM in all datasets. 

In the multiclass classification case, the performance of MSRC, 
SRC, and PFMSRC is very stable with respect to the number of 
genes, and all these methods converge fast to the optimal 
classification rate point. Figure 7 shows that compared with their 
performance in the binary classification case, SRC, MSRC, and 
PFMSRC are less influenced by gene dimension. Note that ALL is 
a multiclass dataset with six subclasses, but PFMSRC can still 
achieve a higher classification rate of 97% accuracy compared 
with SRC and MSRC . The same conclusion can be drawn for the 
SRBCT dataset. 

In Table 8, we report the detailed classification accuracy. 
PFMSRC outperforms its competitors on most gene expression 
profde datasets, whereas SRC and MSRC-SVD perform the 
second best. 

Comparsion of CV performance 

To evaluate the classification performance on imbalanced split 
training/ testing sets, we perform 10-fold stratified CV on tumor 
subtype dataset. All samples are randomly divided into 10 subsets 
based on stratified sampling: nine subsets are used for training, and 
the remaining samples are used for testing. This evaluation process 
is repeated 10 times, and the average result is presented. The 10- 
fold CV results are summarized in Table 9. 

Table 9 shows that as the training sample size increases, the 
performance of these five classification methods is significantiy 
improved. Model based methods LDA-nSVM and ICA-nSVM 
perform very well, with the classification accuracy increased 
significantiy. In particular, the prediction accuracy of ICA-hSVM 
ranges from 86.5% to 96.57% in all tumor expression profile 
datasets, which is comparable with those of SRC, MSRC and 
PFMSRC. 

We can conclude that model-based approaches are more 
vulnerable to the small sample size problem, over-fitting should be 
resolved properly. 

Discussion 

Based on the above experiments, we can draw the following 
observations: 
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1. Sparse representation based methods (SRC, MSRC, 
PFMSRC) consistently outperform the model-based methods 
(LDA+SVM, ICA+SVM) on all experiments. Especially, in 
balance splited datasets the prediction accuracy of model-based 
methods is significantly lower than that of sparse representation 
methods which may be attributed to the small sample size 
problem. However, SRC, MSRC, and PFMSRC perform well 
even when we take 5 samples per subclass for training and the 
rest for testing. 

2. SRC, MSRC and PFMSRC are robust to various sample sizes 
and feature dimensions, as well as converge fast to the optimal 
classification rate. The experiments verify the results in [7], 
which favors the application of those methods. Note that, 
model-based methods (LDA-I-SVM, ICA-I-SVM) exhibit im- 
proved 10-fold CV classification accuracy. A reasonable 
exj)lanation is that the over-fitting phenomena arc' dramatically 
reduced when 90% of original samples are used for training 
and the remaining 10% are used for evaluation in our 
experiments. 

3. PFMSRC outperforms SRC and MSRC in most cases, which 
implies that the parameter free sparse representation and 
weighting strategies can capture more discriminating informa- 
tion, especially in multiclass classification. See Figure 5. 

4. PFMSRC is a parameter-free method, in which the data 
dependent sparsity can be self-adapti\'ely tuned, compared with 
SRC and MSRC in which s(^arch for a regularization 
parameter is laborious work. Moreover, the number of meta- 
samples is a key parameter for MSRC, as shown in Figure 2, 
which makes model selection more difficult. 
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